Approximate Abundance Histograms and Their Use for Genome Size Estimation

نویسندگان

  • Mário Lipovský
  • Tomáš Vinař
  • Broňa Brejová
چکیده

DNA sequencing data is typically a large collection of short strings called reads. We can summarize such data by computing a histogram of the number of occurrences of substrings of a fixed length. Such histograms can be used for example to estimate the size of a genome. In this paper, we study a recent tool, Kmerlight, which computes approximate histograms. We discover an approximation bias, and we propose a new, unbiased version of Kmerlight. We also model the distribution of approximation errors and support our theoretical model by experimental data. Finally, we use another tool, CovEst, to compute genome size estimates from approximate histograms. Our results show that although CovEst was designed to work with exact histograms, it can be used with their approximate versions, which can be produced in a much smaller memory.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Informed and automated k-mer size selection for genome assembly

MOTIVATION Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. RESULTS...

متن کامل

Kmerlight: fast and accurate k-mer abundance estimation

k-mers (nucleotide strings of length k) form the basis of several algorithms in computational genomics. In particular, k-mer abundance information in sequence data is useful in read error correction, parameter estimation for genome assembly, digital normalization etc. We give a streaming algorithm Kmerlight for computing the k-mer abundance histogram from sequence data. Our algorithm is fast an...

متن کامل

Approximate Query Answering using Histograms

Answering queries approximately has recently been proposed as a way to reduce query response times in on-line decision support systems, when the precise answer is not necessary or early feedback is helpful. In this article, we explore the use of precomputed histograms for approximate answering of aggregate queries. Histograms are used by most database systems for selectivity estimation within t...

متن کامل

Approximate Query Answering Using Histograms Bulletin of the Ieee Computer Society Technical Committee on Data Engineering

Answering queries approximately has recently been proposed as a way to reduce query response times in on-line decision support systems, when the precise answer is not necessary or early feedback is helpful. In this article, we explore the use of precomputed histograms for approximate answering of aggregate queries. Histograms are used by most database systems for selectivity estimation within t...

متن کامل

Letter from the Editor - in - Chief Changing

Answering queries approximately has recently been proposed as a way to reduce query response times in on-line decision support systems, when the precise answer is not necessary or early feedback is helpful. In this article, we explore the use of precomputed histograms for approximate answering of aggregate queries. Histograms are used by most database systems for selectivity estimation within t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017